A high quality, industrial data set for binding affinity prediction: performance comparison in different early drug discovery scenarios

Andreas Tosstorff; Markus G Rudolph; Jason C Cole; Michael Reutlinger; Christian Kramer; Hervé Schaffhauser; Agnès Nilly; Alexander Flohr; Bernd Kuhn

doi:10.1007/s10822-022-00478-x

A high quality, industrial data set for binding affinity prediction: performance comparison in different early drug discovery scenarios

J Comput Aided Mol Des. 2022 Oct;36(10):753-765. doi: 10.1007/s10822-022-00478-x. Epub 2022 Sep 25.

Authors

Andreas Tosstorff¹, Markus G Rudolph², Jason C Cole³, Michael Reutlinger², Christian Kramer², Hervé Schaffhauser², Agnès Nilly², Alexander Flohr², Bernd Kuhn²

Affiliations

¹ Roche Pharma Research and Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Grenzacherstrasse 124, CH-4070, Basel, Switzerland. a.tosstorff@gmail.com.
² Roche Pharma Research and Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Grenzacherstrasse 124, CH-4070, Basel, Switzerland.
³ Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge, CB2 1EZ, UK.

PMID: 36153472
DOI: 10.1007/s10822-022-00478-x

Abstract

We release a new, high quality data set of 1162 PDE10A inhibitors with experimentally determined binding affinities together with 77 PDE10A X-ray co-crystal structures from a Roche legacy project. This data set is used to compare the performance of different 2D- and 3D-machine learning (ML) as well as empirical scoring functions for predicting binding affinities with high throughput. We simulate use cases that are relevant in the lead optimization phase of early drug discovery. ML methods perform well at interpolation, but poorly in extrapolation scenarios-which are most relevant to a real-world application. Moreover, we find that investing into the docking workflow for binding pose generation using multi-template docking is rewarded with an improved scoring performance. A combination of 2D-ML and 3D scoring using a modified piecewise linear potential shows best overall performance, combining information on the protein environment with learning from existing SAR data.

Keywords: Docking; Drug design; Lead optimization; Machine learning.

MeSH terms

Drug Discovery*
Ligands
Machine Learning
Molecular Docking Simulation
Protein Binding
Proteins* / chemistry

Substances

Ligands
Proteins

Associated data

figshare/10.6084/m9.figshare.20055014